Non-record: BitNet Ternary — 65M params in 15.9MB (1.1932 BPB) by chrislovescoding · Pull Request #666 · openai/parameter-golf

chrislovescoding · 2026-03-25T01:12:44Z

Summary

Ternary weight quantization ({-1, 0, +1} at ~1.58 bits/weight) enables fitting 65M parameters in a 15.9MB artifact — 3x the parameter count of standard int6 submissions (~22M params) at similar artifact size.

This explores a fundamentally different optimization axis: instead of aggressively quantizing a small model, we train a much larger model with extreme quantization from the start. No other submission in this competition has attempted ternary/BitNet-style training.

Key Results

Metric	Value
Model params	64,529,040
Artifact size	15,878,267 bytes
Post-quant val_bpb	1.2271
Sliding window val_bpb (stride=64)	1.1932
Quantization gap	0.0003 BPB (near-zero)
Steps	5,026 in 600s on 8xH100

Approach

Architecture: 12 layers, 768 dim, 12/6 GQA heads, 3x MLP (hidden=2304), LeakyReLU(0.5)-squared, U-Net skip connections
Ternary STE: Full-precision weights maintained by Muon optimizer; forward pass quantizes to {-1, 0, +1} via Straight-Through Estimator with per-row mean-absolute scaling
Activation schedule: Full-precision for first 30% of wallclock, ternary STE for remaining 70%
Compression: Ternary int8 values + zlib-9 (zstd-22 would save ~1MB further)
Eval: Sliding window stride=64

Key Findings

Ternary training works at 65M scale in 10 minutes — loss fully recovers after the ternary transition (spike from 1.30 to 1.36 BPB, then recovery to 1.23)
Near-zero quantization gap (0.0003 BPB) because the model is trained with ternary STE
3x more parameters fit in the same budget vs int6 — opens a new frontier for parameter-constrained LMs
The ternary approach is orthogonal to and composable with other competition techniques (GPTQ, XSA, EMA, etc.)

Test Plan

Runs in <10 minutes on 8xH100 SXM
Artifact under 16MB (15.9MB)
Reproducible (single seed, deterministic)
Sliding window eval completes within eval budget

5 unique blocks × 3 loops = 15 effective layers, dim=640, SwiGLU MLP, 10/5 GQA heads, loop embeddings, QAT with STE, gradient clipping. ~16.6M params, estimated artifact ~15.5MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

enable_gqa flag not supported on Ampere. Manually expand KV heads and enable fallback SDP backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Auto-detects Windows and disables compile. Can override with USE_COMPILE=0/1 env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drag-drop or paste log files to visualize loss curves, val BPB, step timing, artifact size, and multi-run comparison. Dark theme. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fills 16MB budget (15.4MB est, was 11MB). 23.4M params, 18 effective layers. 8 heads (hd=88), 4 KV heads. ~3,340 steps on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sliding window eval (stride=256 default): overlapping windows give every scored token ~768 tokens of context. Free ~0.03 BPB improvement. FP16 embedding: keeps tok_emb in fp16 instead of int8, avoids quantization quality loss on the most sensitive tensor. Defaults back to v1 config (5 blocks, dim=640). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

4x longer context during training improves predictions and BPB. Batch tokens reduced to 393K to fit memory with longer sequences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Baseline 9-layer 512-dim architecture with all proven wins stacked: - seq4096 training (4x context) - Sliding window eval stride=64 (~0.03 BPB free) - 3x MLP expansion (hidden=1536) - Muon tuning (momentum=0.99, LR=0.02, warmdown=3000) - FP16 embedding in quantization - QAT with STE (near-zero quant gap) - Manual KV repeat for 3090 compat - torch.compile skip on Windows Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Int6 per-row quantization (QUANT_RANGE=31) + zstd-22 compression fits MLP 3x in 16MB. seq1024 for max steps (~12K on 8xH100). Sliding window stride=64. Muon 0.99, LR=0.02, warmdown=3000. FP16 embedding. No QAT (overhead not worth it per PR openai#76). Targets ~1.16 BPB matching top submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Everything from v3 plus: - Int6 STE QAT: fake quantization at QUANT_RANGE=31 during second half of training. Closes ~0.05 BPB quant gap to ~0.001. - SWA: averages 7 checkpoints during warmdown for better generalization. Targets ~1.16 BPB on 8xH100, competitive with top submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use uncompiled base_model for per_token sliding window eval. torch.compile fullgraph can't handle per_token arg changing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fine-tunes the dequantized model on val data during the 10-min eval budget. Up to 30 epochs at lr=0.0005 with 480s time cap. The model adapts to the val distribution before sliding window scoring. Combined with int6+MLP3x+sliding window, targets sub-1.0 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

torch.compile artifacts on base_model caused crashes during TTT. Build a new clean GPT instance, load dequantized weights, then fine-tune. Sliding window eval also uses the TTT-adapted model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Creative approach: blend transformer predictions with PPM (Prediction by Partial Matching) at eval time. PPM costs zero artifact bytes — builds itself from eval data. Bridges 1990s compression with 2026 neural. Also upgrades base: 11 layers, EMA (replaces SWA), LeakyReLU(0.5)^2. Keeps int6 quant, sliding window, Muon tuning, QAT, TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EMA weights haven't been through QAT, so they quantize terribly (0.18 BPB gap). When QAT is enabled, use the QAT-trained weights directly. EMA is only loaded when QAT is disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaced per-token Python loop with vectorized numpy operations. np.add.at for counting, matrix ops for smoothing. 200x faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ternary weights {-1,0,1} at ~1.58 bits/weight enable 3x more params. 12 layers, 768 dim, 3x MLP, 65M params fit in ~14MB after zstd. TernaryLinear with STE for training, custom ternary quantization. Includes sliding window eval + PPM hybrid blend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ternary weights {-1,0,1} at ~1.58 bits/weight enable 3x more parameters (65M vs ~22M for int6) in the 16MB artifact budget. Trained with STE, near-zero quantization gap (0.0003 BPB). 12 layers, 768 dim, 3x MLP. Sliding window val_bpb: 1.1932 (stride=64) Post-quant val_bpb: 1.2271 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chrislovescoding and others added 19 commits March 18, 2026 23:17

Add setup.sh for venv + dependencies

e58cdc2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix GQA compatibility for 3090 (manual KV head repeat)

b580d94

enable_gqa flag not supported on Ampere. Manually expand KV heads and enable fallback SDP backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip torch.compile on Windows (no Triton support)

0c40991

Auto-detects Windows and disables compile. Can override with USE_COMPILE=0/1 env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add training dashboard (single HTML file)

b79a3a2

Drag-drop or paste log files to visualize loss curves, val BPB, step timing, artifact size, and multi-run comparison. Dark theme. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scale up to Config D: 6 blocks, dim=704, hidden=1088

4b7e107

Fills 16MB budget (15.4MB est, was 11MB). 23.4M params, 18 effective layers. 8 heads (hd=88), 4 KV heads. ~3,340 steps on 8xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Default to seq_len=4096 training for better context modeling

78397d6

4x longer context during training improves predictions and BPB. Batch tokens reduced to 393K to fit memory with longer sequences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix sliding window eval crash with torch.compile

6fd2e7c

Use uncompiled base_model for per_token sliding window eval. torch.compile fullgraph can't handle per_token arg changing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Optimize PPM: fully vectorized numpy bigram model (~3s for 62M tokens)

c00ee9f

Replaced per-token Python loop with vectorized numpy operations. np.add.at for counting, matrix ops for smoothing. 200x faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: BitNet Ternary — 65M params in 15.9MB (1.1932 BPB)#666

Non-record: BitNet Ternary — 65M params in 15.9MB (1.1932 BPB)#666
chrislovescoding wants to merge 19 commits intoopenai:mainfrom
chrislovescoding:bitnet-ternary-65m

chrislovescoding commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrislovescoding commented Mar 25, 2026

Summary

Key Results

Approach

Key Findings

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant